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1. INTRODUCTION 

In today's data overload environment, organizations have typically accumulated mountains of 
customer feedback without much categorization. It is difficult for people to manually examine it without bias 
or inaccuracy. speech emotion recognition (SER), is the solution to this issue. Voice emotion recognition will 
use audio input from human speech to convert it to text, evaluate the text to determine whether the sentiment 
of the statement is positive, negative, or neutral, and then provide the analyzed emotion. Providing a substantial 
chunk of structured data instead of just human intuition, which isn't always accurate, improves in enhancing 
decision-making in a wide range of sectors. It examines whether the customer's comment or feedback is 
positive, negative, or neutral. SER can be implemented using many different approaches such as a hybrid 
technique, a deep learning strategy, or a Lexicon-Based method. Here, human emotions in speech are 
recognized using a machine learning approach. 

In this paper, SER using various machines learning techniques (MLT) is tested on two different 
datasets. The datasets consist of voice and text data. The tweets from Twitter is the part of the considered data 
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sets. This paper is organized into five sections. first is introduction. The second section discusses the related 
work. The third section provides an explanation of the implementation. In the fourth section, the findings are 
provided, and in the final portion, the findings are summarised. 


2. RELATED WORK 
The emotions are recognized in speech SER using deep learning (DL) and an attention mechanism 

based on a deep neural network (DNN) [1]. It provides an overview of the most current changes to SER and 
looks at how various attentional mechanisms affect SER performance. The SER using mel frequency cepstral 
coefficients (MFCC), mel spectrogram (MS), and chroma is discussed precisely [2]. The authors used the 
liberos package (Python) for the extraction of mandatory information to achieve the SER. The study of various 
speech emotion detection techniques, including hidden markov model (HMM), support vector machine (SVM), 
and others are presented in [3]. The voice recognition system contains four key components, named speech 
input, extraction of features, SVM Algorithm based grouping, and emotion output, just like other conventional 
ie wae systems. The three steps in the SER are as follows: 

Identification of signals in pitch/energy. This task is performed by Language Development System. 
- Feature reduction, i,e, quantities are encapsulated in fewer features. 
- Mapping the characteristics with emotion, i.e. using the sample data, the characteristics are mapped to the 

emotions 


Speech Processing System 
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Figure 1. Speech emotion recognition system 


Figure 1, depicts the fundamental structure of the human emotion recognition system. Speech signals 
supplies an input signal. The input signal is pre-processing to reduces the noise. Pre-emphasize, spatial and 
temporal filtering are used for pre-processing. Next, the common features are discovered by using feature 
extraction algorithms. Pitch, intensity, and speech rate are considered as the features in the speech signal. 
Finally, classifiers are employed to categories the input signal based on features into emotional states (happy, 
sad, joy, angry, surprise). artificial neural networks (ANN), gaussian mixture models, K-nearest neighbor, 
HMM, vector quantization, and SVMs are the most popular classification algorithms. 

In [4] the author has discussed the emotions are and related theories. Here the author has discussed 
the six different theories from start to till date. Three categories of emotion theories physiological, neurological, 
and cognitive theories can be made [5]. Identifying the sentiments of person based on his communication is 
known as sentimental analysis [6]. It can be aspect-based or fine-grained. The fundamental of sentimental 
analysis and associated challenges are discussed in [7]. Here, the sentimental analyses are explained in terms 
of natural language processing. 

Computational models such as Naive Bayes (NB), SVM and N-gram are successfully providing the 
desired results in sentimental analysis. These models are thoroughly explained in [8]. The analysis of various 
computational models on social networking platform called Twitter are performed [9] and categories the tweet 
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in positive, negative and neural categories. Here, the authors have considered the millions of tweets form the 
twitter. The database is created for the task. The NB method [10] and LR as a text classifier [11] are discussed; 
the methodologies, mathematical understanding and other variants are clearly discussed. linear support vector 
classifier (LSVC) for text classification and the confusion matrix function are discussed in [12]. Fit and 
transform functions can be used to train models are discussed in [13]. In [14] the authors are discussed K- 
Means clustering for grouping of data and multiple algorithms for emotion recognition like LR, random forest, 
NB, Linear svc, and decision tree. A systematic evaluation of hidden markov models for sentiment analysis is 
discussed in [15]. The lexicon-based methodology for sentiment analysis discussed in [16], it entails gauging 
sentiment from the text's use of words or phrases with different semantic connotations. The speech-to-text 
conversion using Google API is developed in [17]. It goes over the differences between synchronous and 
asynchronous audio files, it uses real-time speech input. Sphinx4, Google Speech API, and Bing speech API 
with weighted workload ratings (WWR) as the performance metric can be used for speech recognition [18]. 
For speaker recognition, MFCC is used as a feature and dynamic time warping (DTW) for feature matching, 
with several distance computation methods such as euclidean, correlation, and canberra, and recognition rate 
as the performance parameter. The HMM and neural networks are employed for speech-to-text conversion, 
[19]. The IEMOCAP corpus database is employed for feature extraction [20]. To create an emotion recognition 
model, Auther employed the inception net v3 model. Inception is based on the GoogLeNet Architecture. 


2.1. Approaches of sentimental analysis 

The novel fake comment detection method is proposed to detect fake comments on E-commerce 
platforms [21]. The proposed method used the n-gram and term frequency-inverse document frequency (TF- 
IDF) approach to extract features in computation. The method is tested on a hotel review dataset. The proposed 
method performs better than another method. The sentiment analysis plays vital role in video conferencing. In 
the era of the pandemic video conferencing is the mode of communication. COVID-based analysis of video 
conferencing platforms [22] is presented. The platforms are analyzed in four stages. The analysis is supported 
by respective applications. The sentiment analysis plays a vital role in social media also. A novel algorithm is 
proposed in blockchain technology [23]. It creates a secure connection to transfer the information between 
sender and receiver. The application of the proposed methodology is live-streaming on social media. The 
proposed algorithm’s performance is compared with other cutting-edge algorithms and found outstanding. The 
factors affecting the performance of wireless proxy internet protocol are addressed. The key [24] exchange 
protocol is negotiated. The experimentation is conducted by considering the safety of the TSL protocol. The 
evolutionary algorithms play a very vital sentiment analysis and speech recognition. An evolutionary method 
using the bee algorithm [25] is proposed to optimize the wireless sensor network. The results are compared 
with the genetic algorithm. The proposed method performed better than the genetic algorithm. 


3. METHOD 

The SER system takes human speech as an input and transforms it into text then it analyzes this text, 
at last, it determines if the sentiments of the input statement are positive, negative, or neutral. The proposed 
method collects real-time voice input from users, converts it to text, and analyses whether the user's statement 
is positive, negative, or neutral." SER is a process of recognizing human emotions. natural language processing 
and machine learning techniques are the driving force of SER. The source of input in SER is real-time human 
speech, in the proposed method an additional input option-text format is also considered. A 2-way input 
acceptor is in implementation. The input method can be explained as follows: 


Case 1. Case 2. 
If Input_signal = < text> If Input_signal = < Voice> 
Then Then 
Apply Machine Learning Algorithm Convert Voice into Text 
Detect emotions Apply Machine Learning Algorithm 
End if Detect emotions 
End if 


To identify the best machine learning techniques (MLT), there are six MLT are tested on given dataset, 
named, LR, Linear SVC, valence aware dictionary for sentiment reasoning (VADER), Text Blob, and nave 
bayes. As mentioned earlier, in real time the speech signal is converted into text. Then further processing take 
place. Whereas, the user has a facility to directly provide the text which can be directly process to categories 
into any one category of the emotion. For text classification, a deep learning model is used. As shown in Figure 
2, model is trained to determine the sentiments of the inputted text as positive, negative, or neutral. The 
following is a detailed description of the proposed approach. 
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Figure 2. Proposed method 


3.1. Speech to text 

In this module, a deep learning model accepts the audio signal and convert the audio into the text. The 
Google API is used recognized the audio language. Using Google API, the speech signals are converted into 
electronic signals. The next step is pre-processing, in pre-processing, the electronic signals are normalized. 
next, the output of the pre-processing supplies to speech-to-text. The speech-to-text module employs the deep 
learning computational model. The text's context is provided by the deep learning module. Finally, the output 
of the speech-to-text provided to auto ML-NLP and auto ML-NLP generates the final text. 


3.2. Text pre-processing 

It includes data cleaning, suppressing words, tokenization, and stemming/lemmatization. In data 
cleaning, the symbols and punctuations are removed from the text. All the natural languages have some 
common words. These common words take valuable computation time in data pre-processing. These common 
words generally do not provide any meaning to the text. Hence, they can be avoided from the text. These 
common words are known as stop world. Suppressing words included removing the stop words from the 
provided text. The smallest unit of the text is the words. The words can further be divided into chunks, which 
are referred to as tokens in natural language processing (NLP). The tokens are a useful unit for the semantic 
analysis. The process of splitting the text into a token is known as tokenization. Finally, the words are 
normalized using stemming/lemmatization. The stemming/lemmatization provides the root words. These root 
words are further used in processing. 


3.3. Feature extraction 

In this module, the features are obtained from the text. The features are extracted using word bag 
method. In word bag method, the words are searched into the text and tagged to the corresponding class. The 
features of the text are extracted via text vectorization. 


3.4. Sentiment classification using Naive Bayes classifier 

Classification and predication are important aspects of ML. NB is simple and powerful algorithm. It 
is a classification technique that relies on the independent predictor assumption and the bayes theorem. It has 
two phases: a naive period and a bayes phase. Even if these features are reliant on one another or on the 
existence of the other features, the NB classifier assumes that the presence of one feature in a class has no 
influence on the presence of any other feature. 

The NB classification model is straightforward to construct and is particularly beneficial for very big 
datasets in probability theory and statistics-based theorems. It expresses the likelihood of an event based on 
the prior knowledge of possible confounding variables. The Bayes theorem uses conditional probability. The 
conditional probability is the likelihood of an event occurring in the presence of one or more other events. The 
proposed method is tested on dataset of 17K Twitter tweets. 


3.5. Training and testing 


The training and testing are done by using all 17K Twitter tweets. The NB classifier is trained on 55 
percent of the dataset. Whereas, it has been tested on the remaining 45 percent of data. The advantage of the 
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NB classifier is that it is extremely fast and only requires one pass through the data. The NB classifier has the 
advantage of being extremely fast and just requiring one pass over the data. 


4. RESULTS AND DISCUSSION 

As discussed earlier, 17K Twitter tweets are considered for the training Naive-Bayes classifier. These 
17K tweets are the text datasets. Apart from these, other texts, audio files, and real-time voice input from 
Kaggle.com are also considered for the SER. 


4.1. Data collection 

We must first collect input samples, which can be in the form of text, audio files, or real-time voice 
input, in order to implement the SER system. It will be used to hone the model's skills. Wav or mp3 files are 
commonly used as audio input. We used two datasets based on Twitter tweets that included positive, negative, 
and neutral reviews. 


4.1.1 Analyzing dataset 

The data of the dataset is labeled in three categories. These categories are positive, negative, and 
neutral. The labeling of the dataset is done on the basis of the words and the context of the words. NLP with 
NLTK is used to find the token and label the data. The pictorial representation of the labelled data is displayed 
in Figure 3. 


negative positive neutral 
label 


Figure 3. Dataset analysis 


4.2. Training the model 
Clear data, labeled data, and normalized data are the characteristics of a good database for speech 
emotion recognition. In the proposed method, sklearn model and split model of Python is used for the training. 
The training involves the following steps. 
- Split the data into two sets: training and testing. The ratio of training and testing is 80 and 20 respectively. 
- Extraction of features from the data i,e, converting textual contents into numerical logic. 
- Removing the stop words, Le data cleaning. 
- Model fitting, Model fitting refers to generalizing of testing data with respect to training data by ML model. 
- Inthe proposed method, there are three classifiers, named, multinomial Naive Bayes (MNB) model, LR 
model, and LSVM classifier model are used. The results of these classifiers are discussed in the next 
sections. 


4.3. Multinomial Naive Bayes model 

The proposed model employs the MNB Model. It has been tested in three classes. As per Figure 4, all 
three classes, the support factor is 34534. It shows, the data is well balanced. Therefore, it gives the satisfactory 
results in accuracy blends specificity and sensitivity, recall and precision. This model achieves an accuracy of 
86% for dataset 1 as shown in Figure 4 and 87% for dataset 2 as shown Figure 5. 


4.4. Logistic regression mode 
The LR model is tested on two datasets. As per the results obtained, the data is balanced. The model’s 


accuracy is 95% for both, dataset1 and dataset 2 as illustrated in Figures 6 and 7. 
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4.5. Linear support vector classifier model 

As per other models, the linear support vector (LSV) classifier model is tested two datasets. On 
dataset-1, the model's accuracy is 96 percent, as shown in Figure 8. On dataset-2, the model's accuracy is 96 
percent, as shown in Figure 9. Comparison of aforementioned methods are explained in further section. 


4.6. Comparisons 

The results of all the aforementioned classifiers are trained and tested on two different datasets. The 
summary of results is shown in Table 1. All the three classifiers are evaluated on four parameters, called, 
precision, recall, fl-score and accuracy. As per the results, it has been observed that linear support vector 
machine (LSVM) has outperformed all the other two classifiers. The precision value, recall value, Fl-score 
and accuracy of Linear SVC are 96%. on both datasets. Whereas, the MNB has performed worst in 
experimentation. The graphical representation of accuracy is shown in Figure 8 and Figure 9. 
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negative negative 
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2000 2000 
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Figure 4. Multinomial Naive Bayes on dataset-1 Figure 5. Multinomial Naive Bayes on dataset-2 
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Z E 
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Figure 6. Logistic regression on dataset- | Figure 7. Logistic regression on dataset-2 


Comparison between Different Models Of SER 
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5. 


Table 1. Comparison table 


Model Precision Recall _fl-Score | Accuracy 
Multinomial NB on Dataset-1 86% 87% 87% 87% 
Logistic Regression Model On dataset-1 95% 95% 95% 95% 
Linear SVC on Dataset-1 96% 96% 96% 96% 
Multinomial NB on Dataset-2 88% 88% 88% 87% 
Logistic Regression Model On dataset-2 95% 95% 95% 95% 
Linear SVC on Dataset-2 96% 96% 96% 96% 


CONCLUSION 
The analysis of various computational models that could be used as a classifier in SER is performed. 


Three different computational models named MNB, LR, and LSVM are trained and tested on two different 
datasets. The datasets consisted of video, audio, and text data. Eighty percent (80%) of test samples are used 
for training and twenty percent (20%) of samples are used for testing. In experimentation, it has been observed 
that MNB has an average accuracy of eighty seven percent (87%), LR has a ninty five percent (95%) average 
accuracy, and LSV has a ninety six percent (96%) average accuracy. Hence, LSVC is the most efficient of the 
three models and provides the highest level of emotion analysis accuracy. 
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